| Corpus | Topics | Alpha | Beta | Coherence |
|---|---|---|---|---|
| BoW | 9 | 0.60 | 0.50 | -0.42 |
| BoW | 9 | 0.60 | 0.10 | -0.42 |
| BoW | 9 | 0.60 | 0.70 | -0.42 |
| BoW | 9 | 0.60 | 0.90 | -0.42 |
| BoW | 9 | 0.60 | 0.30 | -0.42 |
| tf-idf | 7 | 0.20 | 0.90 | -4.25 |
| tf-idf | 7 | 0.20 | 0.10 | -4.25 |
| tf-idf | 7 | 0.20 | 0.30 | -4.25 |
| tf-idf | 7 | 0.20 | 0.50 | -4.25 |
| tf-idf | 7 | 0.20 | 0.70 | -4.25 |
Sentiments and Topics in South African SONA Speeches
Abstract
In the domain of natural language processing (NLP), a descriptive text-analysis was conducted on the State of the Nation Addresses (SONAs) by South African presidents from 1994 to 2023, employing emotion-and-theme extraction techniques. Sentiment analysis, leveraging two lexicons (\(\texttt{AFINN}\) and \(\texttt{bing}\)), was applied to gauge the polarity of emotions within the speeches. Concurrently, five topic models were applied, namely Latent Semantic Analysis (LSA), Probabilistic Latent Semantic Analysis (pLSA), Latent Dirichlet Allocation (LDA), Correlated Topic Model (CTM), and Author-Topic Model (ATM), to track thematic patterns.
No overt trajectory differences were found between the two lexicons, in terms of positive and negative sentiment across both time periods and president. It was further observed that the specific words contributing to the sentiments scores varied across the speeches, yet no discernible pattern emerged when stratified by president. In terms of thematic consistency, the LSA, pLSA, LDA, and CTM exhibited uniformity in identifying general governance-related topics in different spheres (economic, social, structural) and other contemporary contexts (pandemic and sport). In contrast, the application of the ATM revealed a nuanced differentiation, unearthing more abstract themes aligned with personal presidential agendas.
Introduction
The field of Natural Language Processing (NLP) is faceted by techniques tailored for theme tracking and opinion mining which merge part of text analysis. Though, of particular prominence, is the extraction of latent thematic patterns and the establishment of the extent of emotionality expressed in political-based texts.
Given such political context, it is of specific interest to analyse the annual State of the Nation Address (SONA) speeches delivered by six different South African presidents (F.W. de Klerk, N.R. Mandela, T.M. Mbeki, K.P. Motlanthe, J.G. Zuma, and M.C. Ramaphosa) ranging over twenty-nine years (from 1994 to 2023). This analysis, descriptive and data-driven in nature, endeavours to examine the content of the SONA speeches in terms of themes via topic modelling (TM) and emotions via sentiment analysis (SentA). Applying a double-bifurcated approach, SentA will be executed within a macro and micro context both at the text (all-presidents versus by-president SONA speeches, respectively) and token (sentences versus words, respectively) level, as shown in Figure 1. This underlying framework is also assumed for TM, with the exceptions of only employing it within a macro-context at text level and a micro-context at the token level, as seen in Figure 2.
Through such a multi-layered lens, the identification of any trends, both in terms of topics and sentiments, over time at both a large (presidents as a collective) as well as at a small (each president as an individual) scale is attainable. This explicates not only an aggregated perspective of the general political discourse prevailing within South Africa (SA), but also a more niche outlook of the specific rhetoric employed by each of the country’s serving presidents during different date periods.
To achieve all of the above-mentioned, it is first relevant to revise foundational terms and review related literature in context of politics and NLP. All pertinent pre-processing of the political text data is then considered, followed by a discussion delving into the details of each SentA and TM approach applied. Specifically, two different lexicons are leveraged to describe sentiments, whilst five different topic models are tackled to uncover themes within South-African-presidents’ SONA speeches. Ensuing the implementation of these methodologies, the results thereof are detailed in terms insights and interpretations. Thereafter, an overall evaluation of the techniques in terms of efficacy and inadequacy is overviewed. Finally, focal findings are highlighted and potential improvements as part of future research are recommended.
Literature Review
SONA
SONA, a pivotal event in the political programme of Parliament, serves as a presidential summary for the South African public. Specifically, the country’s current domestic affairs and international relations are reflected upon, past governmental work is perused, and future plans in terms of policies and civil projects are proposed. Through this address, accountability on the part of government is re-instilled and transparency with the public is re-affirmed on an annual basis, either once (non-election year) or twice (pre-and-post election) (Minister Faith Muthambi 2017). The text analysis of such SONA speeches, via the implementation of TM and SentA, has been previously done for Philippine presidents (Miranda and Bringula 2021). Though, it is now of interest to extend such an application to another country, SA.
Topic modelling (TM)
TM, an unsupervised learning approach, implicates the identification of underlying abstract themes in some body of text, in the absence of pre-specified labels (Cho 2019). In general, there are two topic-model assumptions: each document comprises of a mixture of topics and each topic consists of a collection of words (Zhang 2018). Different types of topic models exist, each with varying complexity in terms of the way in which topics are generated. The simplest one, Latent Semantic Analysis (LSA), has previously been implemented to discover patterns of lexical cohesion in political speech, specifically that of the former Prime Minister of the United Kingdom, Margaret Thatcher (Klebanov, Diermeier, and Beigman 2008). Improving on LSA methodology, Probabilistic LSA (pLSA) has been implemented in healthcare (Zhu 2014) and educational (Ming et al. 2014) contexts, albeit no application thereof in political science was found. A further sophisticated model, Latent Dirichlet Allocation (LDA), has been used to determine trending topics in news on governmental YouTube channels (Subhan et al. 2023).
Sentiment analysis (SentA)
SentA involves deciphering the intent of words to infer certain emotional dimensions labelled either in polarized (negative/positive) or higher-dimensional terms (niche feelings like joy/sadness). Various unigram lexicons have been derived to such extents. For example, the R-based \(\texttt{nrc}\) lexicon dichotomously classifies words with yes/no labels in categories such as positive, negative, anticipation, anger, and so forth. In contrast, the Python-based \(\texttt{TextBlob}\) lexicon processes textual data in the form of a tuple where a polarity score (ranges between -1 and +1 which relates to negative and positive sentiment, respectively) and a subjectivity score (ranges between 0 and 1 which refers to being very objective or very subjective, respectively) is produced. Using such pre-defined lexicons has been previously utilized to analyze political communication, specifically in terms of campaign polarization, via SentA (Haselmayer and Jenny 2017).
Data
Tokenization
The process of tokenization entails breaking up given text into units, referred to as tokens (or terms), which are meaningful for analysis (Zhang 2018). In this case, these tokens take on different structures, based on either a macro-context (i.e., sentences) or micro-context (i.e., words). At both scales, the way in which these tokens are valued will be varied. The value will either be defined by a bag-of-words (BoW) or term-frequency, inverse-document-frequency (tf-idf) approach. The former way implicates accounting for the number of occurrences of some token in some document. On the other hand, the latter way not only regards the frequency of some token, but also the significance thereof. Thus, tf-idf involves the assignment of some weight to each token in a document which in turn reflects its importance relative to the entire collection of documents (corpus). It then follows that the tf-idf value of a token t in a document d within a corpus D is calculated as the product of two constituents. The first being tf(t,d) defined as the quotient of the frequency of token t in document d and the total number of tokens in document d, whereas the second is idf(t, D) denoted by the quotient of the natural logarithm of the total number of documents in corpus D and the number of documents containing the token t (Silge and Robinson 2017).
Number of topics
In order to determine the optimal number of topics, a coherence score is calculated. This metric measures the ability of a topic model to distinguish well between topics that are semantically interpretable by humans and are not simply statistical-inference artifacts. Hence, the number of topics as well as any other topic-model hyperparameters (like \(\alpha\) and \(\beta\) for LDA) are tuned to values that yield the maximum coherence score, allowing for the most understandable themes.
Data pre-processing
Pre-processing procedures is paramount to prepare and refine the data for subsequent analytical scrutiny. Initially, a process of data reading and parsing is executed. Via string operations, crucial information such as the year and the president’s name are discerned from the file names and integrated as discrete columns within the dataframe to be analysed. Then text-cleaning follows, involving the removal of URLs, HTML entities, and newline characters from the SONA speeches, and employing periods and question marks as delimiters to fragment the text into sentences. This stage further includes the flattening of nested lists and the exclusion of null strings. Then a lemmatization function, tailored to convert words into their base-root form using appropriate part-of-speech tags, is defined and applied to the refined data. This phase also involves the exclusion of stop words and a custom list of additional words (e.g., greeting references like “honourable”, “member”, “madame”, “speaker”), with a stringent focus on retaining only those words recognized as English (given the SONA speeches at times include one of the other official African languages like Zulu).
Methods
Topic modelling
Latent Semantic Analysis (LSA)
LSA (Deerwester et al. 1990) is a non-probabilistic, non-generative model where a form of matrix factorization is utilized to uncover few latent topics, capturing meaningful relationships among documents/tokens. As depicted in Figure 3, in the first step, a document-term matrix DTM is generated from the raw text data by tokenizing d documents into w words (or sentences), forming the columns and rows respectively. Each row-column entry is either valued via the BoW or tf-idf approach. This DTM-matrix, which is often sparse and high-dimensional, is then decomposed via a dimensionality-reduction-technique, namely truncated Singular Value Decomposition (SVD). Consequently, in the second step the DTM-matrix becomes the product of three matrices: the topic-word matrix \(A_{t*}\) (for the tokens), the topic-prevalence matrix \(B_{t*}\) (for the latent semantic factors), and the transposed document-topic matrix \(C^{T}_{t*}\) (for the document). Here, t*, the optimal number of topics, is a hyperparameter which is refined at a value (via the coherence-measure approach) that retains the most significant dimensions in the transformed space. In the final step, the text data is then encoded using this top-topic number.
Given LSA only implicates a DTM-matrix, the implementation thereof is generally efficient. Though, with the involvement of truncated SVD, some computational intensity and a lack of quick updates with new, incoming text-data can arise. Additional LSA drawbacks include: the lack of interpretability, the underlying linear-model framework (which results in poor performance on text-data with non-linear dependencies), and the underlying Gaussian assumption for tokens in documents (which may not be an appropriate distribution).
Probabilistic Latent Semantic Analysis (pLSA)
Instead of implementing truncated SVD, pLSA (Hofmann 1999) rather utilizes a generative, probabilistic model. Within this framework, a document d is first selected with probability P(d). Then given this, a latent topic t is present in this selected document d and so chosen with probability of P(t|d). Finally, given this chosen topic t, a word w (or sentence) is generated from it with probability P(w|t), as shown in Figure 4. It is noted that the values of P(d) is determined directly from the corpus D which is defined in terms of a DTM matrix. In contrast, the probabilities P(t|d) and P(w|t) are parameters modelled as multinomial distributions and iteratively updated via the Expectation-Maximization (EM) algorithm. Direct parallelism between LSA and pLSA can be drawn via the methods’ parameterization, as conveyed via matching colours of the topic-word matrix and P(w|t), the document-topic matrix and P(d|t) as well as the topic-prevalence matrix and P(t) displayed in Figure 3 and Figure 4, respectively.
Despite pLSA implicitly addressing LSA-related disadvantages, this method still involves two main drawbacks. There is no probability model for the document-topic probabilities P(t|d), resulting in the inability to assign topic mixtures to new, unseen documents not trained on. Model parameters also then increase linearly with the number of documents added, making this method more susceptible to overfitting.
Latent Dirichlet Allocation (LDA)
LDA is another generative, probabilistic model which can be deemed as a hierarchical Bayesian version of pLSA. Via explicitly defining a generative model for the document-topic probabilities, both the above-mentioned pitfalls of pLSA are improved upon. The number of parameters to estimate drastically decrease and the ability to apply and generalize to new, unseen documents is attainable. As presented in Figure 5, the initial steps first involve randomly sampling a document-topic probability distribution \(\theta\) from a Dirichlet (Dir) distribution \(\eta\), followed by randomly sampling a topic-word probability distribution \(\phi\) from another Dirichlet distribution \(\tau\). From the \(\theta\) distribution, a topic t is selected by drawing from a multinomial (Mult) distribution (third step) and from the \(\phi\) distribution given said topic t, a word w (or sentences) is sampled from another multinomial distribution (fourth step). The associated LDA-parameters are then estimated via a variational expectation maximization algorithm or collapsed Gibbs sampling.
Correlated Topic Model (CTM)
Following closely to LDA, the CTM (Lafferty and Blei 2005) additionally allows for the ability to model the presence of any correlated topics. Such topic correlations are introduced via the inclusion of the multivariate normal (MultNorm) distribution with t length-vector of means \(\mu\) and t \(\times\) t covariance matrix \(\Sigma\) where the resulting values are then mapped into probabilities by passing through a logistic (log) transformation. Comparing Figure 5 and Figure 6, the nuance between LDA and CTM is highlighted using a light-grey colour, where the discrepancy in the models come about from replacing the Dirichlet distribution (which involves the implicit assumption of independence across topics) with the logit-normal distribution (which now explicitly enables for topic dependency via a covariance structure) for generating document-topic probabilities. The other generative processes previously outlined for LDA is retained and repeated for CTM. Given this additional model complexity, the more convoluted mean-field variational inference algorithm is employed for CTM-parameter estimation which necessitates many iterations for optimization purposes. CTM is consequently computationally more expensive than LDA. Though, this snag is far outweighed by the procurement of richer topics with overt relationships acknowledged between these.
Author Topic Model (ATM)
ATM (Rosen-Zvi et al. 2012) extends LDA via the inclusion of authorship information with topics. Again, inspecting Figure 5 and Figure 7, the slight discrepancies between these two models are accentuated with the light-grey colour. Here, for each word w in the document d an author a is sampled uniformly (Uni) at random. Each author is associated with a distribution over topics (\(\Psi\)) sampled from a Dirichlet prior \(\alpha\). The resultant mixture weights corresponding to the chosen author are used to select a topic t, then a word w (or sentence) is generated according to the topic-word distribution \(\phi\) (drawn from another Dirichlet prior \(\beta\)) corresponding to that said chosen topic t. Therefore, through the estimation of the \(\psi\) and \(\phi\) parameters, not only is information obtained about which topics authors generally relate to, but also a representation of these document contents in terms of these topics, respectively.
Sentiment analysis
AFINN
The R-based \(\texttt{AFINN}\) lexicon scores words across a range spanning from the value of -5 to +5. Intuitively, words scored closer to the lower-boundary value relate to more negative sentiment, and in contrast higher positive sentiment is revealed if rather closer to the upper-boundary value (Silge and Robinson 2017).
Bing
Unlike \(\texttt{AFINN}\) , the R-based \(\texttt{bing}\) lexicon does not provide sentiments via some scoring system. Instead, it simply assigns a binary label of a word being interpreted as either positive or negative (Silge and Robinson 2017).
Exploratory Data Analysis
From Figure 8, it is evident that the word “government” is mainly referenced to across all SONA speeches. This word dominance draws upon how the importance of this authority body, which is integral to the governance of SA, is emphasized. The frequent usage of the words “people” and “public” indicates a sense of inclusivity where the idea of togetherness is implicitly suggested. Other words, such as “development” and “new”, are indicative of ideas of growth and renewal. Lastly, a sense of security and safety is provided with the recurring use of the word “ensure”.
After faceting the most frequent words by president, as displayed in Figure 9, there are some slight nuances noted. For instance, former president de Klerk used words which were emblematic of the political paradigm shift that occurred during the time of his term. Words such as “transitional”, “constitutional”, and “constitution” reflects the country’s progression from an exclusive, segregated to a more inclusive, democratic state. This political and legal reform directed towards achieving societal equality is further underscored by the words, “parties”, “party”, and “election”. The pivotal role of proper partnerships being formed, which would have further aided in maintaining this change, is foregrounded with the word “alliance”.
Similarly, this idea of unity has also been foregrounded in the other five presidents’ speeches with the commonly shared word “people”. Though, unlike de Klerk, the other former presidents (Mandela, Mbeki, Motlanthe, Zuma) and current president (Ramaphosa), seem to similarly place more focus on the explicit communication of policies and vision (“development” and “work”) and the establishment of a sense of responsibility and accountability on their part as president (“government” and “ensure”). Some minor distinctions between these aforementioned presidents can be made. Mandela, Mbeki, and Motlanthe, for example, seemed to draw more attention to “society” or “social” progress, whilst Zuma and Ramaphosa appeared to place more prominence on “economic” progress.
Sentiment Analysis Results
Comparing Figure 10 (a) and Figure 10 (b), it is evident that there is no obvious, overt difference in the computed net sentiment scores, which are overall positive, across time and presidents for the two different lexicons. Any slight variation between \(\texttt{AFINN}\) and \(\texttt{bing}\) is most likely attributed to the lexicons’ varying scales (+5/-5 versus +1/-1, respectively). Hence, any sentiment derived from the former lexicon might be slightly more exaggerated in nature compared to the latter lexicon. This is noted when checking the \(y\)-axes range of the sentiment scores, which reach a maximum of 600 for the \(\texttt{AFINN}\) lexicon and only 300 for the \(\texttt{bing}\) lexicon.
Across both lexicons, from de Klerk to Mbeki’s presidential terms, positive sentiment seems to steadily rise. Though, after a peak of high, positive sentiment scores from Mbeki’s SONA speeches, there is a slight decline in this overall positivity. This is especially present throughout Zuma’s presidential term.
From inspecting Figure 11 , it is apparent that the relative trajectory of underlying emotion is generally similar for each president. The dips and troughs in sentiment prevalent for both \(\texttt{bing}\) and \(\texttt{AFINN}\) lexicons occur at approximately the same sentences in the respective presidents’ speeches. Though, there is some stark contrast found between the two lexicons when comparing for Zuma’s speeches. For this president, the negative falls and positive rises are more exaggerated for the \(\texttt{AFINN}\) compared to the \(\texttt{bing}\) lexicon. Additionally, the sentiment pattern of Mbeki’s speeches again seems more skewed to the positive side, with more frequent extreme rises to high sentiment score values across sentences. It is also again seen that more negative sentiment is expressed in Zuma’s speeches, give the more dominant dips. For Ramaphosa, there appears to be more of a balance between positive-and-negative sentiment. There are no extreme, outlying rises/falls, rather a more consistent sawtooth-like pattern is prominent.
Similarly with the general sentiment-score trajectories, there are more overlaps between the two lexicons when comparing the specific words which contribute to the positive and negative sentiments, as displayed in Figure 12. The same seven (out of the top ten) words commonly contribute to negative sentiment (“corruption”, “crime”, “violence”), in addition to positive sentiment (“improve”, “support”, “progress”) across both lexicons. Albeit, the extent of these aforementioned words’ contributions to the respective sentiments do slightly vary in amounts. Some unique, independent words also add to the negative sentiment for the \(\texttt{AFINN}\) lexicon (“problems”, “unemployment”) and \(\texttt{bing}\) lexicon (“issues”). Likewise, there are distinctive words for this former lexicon (“growth”, “ensure”, “great”) and latter lexicon (“well”) attributed to positive sentiment.
After faceting Figure 12 by president, as presented in the sub-plots of Figure 13, essentially no variability is indicated in terms of uniqueness and the contribution magnitude thereof. For both lexicons, the same set of words add the same amount to each sentiment. Furthermore, commonalities between words contributing to the sentiments is evident between the five presidents after de Klerk. Negative-sentiment words like “unconstitutional”, “deprive”, “discrimination”, and “boycott” and positive-sentiment words such as “proud”, “succeeded”, and “peaceful” only features in de Klerk’s speech. All of these aforementioned words seem to directly relate to the change in political context during de Klerk’s term. Whilst, the words prevailing in the other five presidents’ speeches appear to foreground the continuation of this changed political climate with positive-sentiment words like “improve”, “better”, “freedom” and “peace”. Additionally, the shared negative-sentiment words like “crime”, “corruption” and “poverty” foreground the commonality of perpetuating problems that became pronounced throughout all presidential tenures after de Klerk.
Topic modelling Results
LSA
In the micro-context (i.e., word tokenization) of LSA implementation, the maximum coherence score of approximately -1.5 seen in Figure 14 indicates that three topics are optimal when utilizing the tf-idf approach. In contrast, there is no discernible difference in the coherence scores across a range of topic numbers when instead using the BoW approach. Hence, for comparative purposes, three topics are also chosen as best in this instance.
Considering Figure 15, a different overarching focus seems to come to the fore contingent on whether the BoW or tf-idf approach is examined. With respect to the tf-idf-related corpus, the topics appear concentrated on current issues. The first topic centers on a sole problem, particularly that of common (“compatriots”) “pandemic” preparedness (“plan”). Whereas, the second topic broadens to other additional issues contextual to the country. Hence, in this case, the challenges range from energy (“eskom”) problems (“loadshedding”) to the “covid” “pandemic” to the government’s corruption (state “capture”). Most words contained in the third topic all already feature in topic one or two already. Given this lack of unique distinction, topic three is said to simply be a synthesis of the two, aforementioned topics. In comparison, the BoW-related corpus alludes to a different array of topics. The first topic being that of societal collectives (“people”, “public”, “us”). The second topic encompassing structural resources suggested with words like “infrastructure”, “investment” and “energy”. Finally, the third topic, sustainability, relates to the prospective of long-term (“years”) endurance (“continue”) across these collectives and structures.
Unlike within the micro-context, LSA now applied in the maro-context (i.e., sentence tokenization) indicates more variability in the choice of the optimal topic number for both BoW and tf-idf approaches. Due to the excessive jumps between high and low coherence scores across the topic amounts seen in Figure 16, it is opted to choose the lowest best number. This choice aims to limit potential theme-overlapping (i.e., many common words shared across topics) and allow for more conciseness.
When instead tokenized by sentences, there is poor topic distinction for the BoW-related corpus. As seen in Figure 17, both topics produced signal governmental structures with the overlap in words (“president”, “members”, “deputy”, “leaders”). Contrastingly, the tf-idf-related corpus allows for more distinguishable topics with one signifying the structures (“programme”) put in place by “government” for “social” and “economic” “development”, and the other suggesting a sense of gratitude (with the dominant weighting of the word “thank”) and some reformation indicators like “opportunity” and “growth”.
pLSA
As previously found with the implementation of LSA, there is essentially no variability in the coherence scores when utilizing the BoW approach for the word-tokenized application of pLSA. Though, like with LSA, there are fluctuating changes in the coherence scores across topic numbers when considering the tf-idf approach. Again, it is opted to take the minimum-best topic number (based on tf-idf) to apply in both approaches, amounting to five in this case.
For the word-tokenized speeches within the BoW framework, the first four topics reflect four different facets of governance, the processes (“work”, “development”) established by institutions which direct communities (“public”, “people”). Emphasis on the arrangement and form of “government” is made via explicit reference to the “system” and “sector” of “service”, thereby denoting structural governance. Then, issues that affect the order and well-being of the “nation” and “society” in general like “crime” alludes to ideas of social governance. Economic governance forms the third topic with the mention of “business”, “economy”, “work”, “job” and “investment”, whilst the reference to the resource “water” produces the fourth topic and facet of governance. The re-appearance of the words, “service” and “sector” from topic one and “business” and “investment” from topic three means the fifth topic can be described as a combination of both structural and economic governance, respectively.
In comparison, the tf-idf framework, constructs completely different topics which are unrelated to governance. Firstly, the pandemic (“covid”) and the resultant strain on resources (“vaccine”, “energy”, “rand”, “emergency”) features in topic one. Ideas of prosperity (“happiness”, “dream”, “triumph”) and resolution (“question”, “sense”, “reconstruction”) then forms the second and third topic, respectively. The last two topics reference sport (“cricket”, “soccer”, “bafana”) as well as the idea of associated support (“compatriot”) or alternatively the effect on the economy (“rand”, “jobcreation”).
Again, like in LSA, the coherence scores oscillate across a topic-number range when using sentence-tokenized speeches for pLSA. For both corpus approaches, the first peak is reached at four topics, as rendered in Figure 20.
From Figure 21, some slight similarity in topic type compared to the word-tokenized speeches is seen, specifically that of governance facets. Structural and social governance re-appear as the second and fourth topics for the BoW-related corpus, respectively. Whilst, economic governance is again seen for the tf-idf-related corpus, either combined with resource governance (topic one) or growth (topic four). Governmental structures, previously described as a topic part of the sentence-tokenized LSA implementation, also appears here again. Evidently, there is some topic correspondence across the different tokenized corpora and different topic modelling approaches.
LDA
A systematic hyperparameter grid search was conducted, spanning \(\alpha\) values from 0.1 to 1 in increments of 0.1 and \(\beta\) values from 0.1 to 1 in increments of 0.2, with the objective of determining the optimal combination for coherence score maximization across a topic range of two to ten. This empirical approach revealed, through the application of contour plots to both word-and-sentence tokenized speeches (Figure 22), a distinct pattern: the coherence score remained largely unaffected by variations in the \(\beta\) parameter, yet exhibited a pronounced sensitivity to alterations in the \(\alpha\) parameter.
Examination of Table 1, presenting a subset of the grid search results for word-tokenized speeches, further corroborated these findings. The values therein demonstrate a tendency for higher \(\alpha\) values to consistently yield maximum coherence scores, irrespective of the \(\beta\) value. This table also highlights the superiority of the BoW-related corpus over tf-idf–related corpus in achieving an optimal hyperparameter combination, particularly with a topic number of nine. Parallel results are observed in Table 2, where the employment of high \(\alpha\) values in conjunction with the BoW-related corpus, regardless of the value of \(\beta\), result in the most favourable coherence scores. In this case, a minimal amount of two topics is sufficient.
| Corpus | Topics | Alpha | Beta | Coherence |
|---|---|---|---|---|
| BoW | 2 | 0.90 | 0.90 | -18.98 |
| BoW | 2 | 0.80 | 0.50 | -18.98 |
| BoW | 2 | 0.70 | 0.10 | -18.98 |
| BoW | 2 | 0.70 | 0.30 | -18.98 |
| BoW | 2 | 0.70 | 0.50 | -18.98 |
| tf-idf | 2 | 0.40 | 0.10 | -20.56 |
| tf-idf | 2 | 0.40 | 0.30 | -20.56 |
| tf-idf | 2 | 0.40 | 0.50 | -20.56 |
| tf-idf | 2 | 0.40 | 0.70 | -20.56 |
| tf-idf | 2 | 0.40 | 0.90 | -20.56 |
Interactive visualization of the LDA results are produced using \(\texttt{pyLDAvis}\). The left-panel of this tool presents an inter-topic distance map, offering a comprehensive view of the topic distribution and relationships, where multi-dimensional scaling (MDS) is applied to project these relations onto a two-dimensional plane. The size of each topic circle in this map is directly proportional to its prevalence. While, the right-panel provides a horizontal bar chart delineating the most salient words for each topic, thereby facilitating a deeper understanding of the associated meaning. The juxtaposition of corpus-wide frequencies (light-blue bars) against topic-specific frequencies (red bars) for each term allows for an thorough exploration of topic-term relationships (Sievert and Shirley 2014).
Further analytical depth is achieved through the application of the relevance metric, which leverages a weight parameter \(\lambda\) (calculation thereof shown in the right-panel footnote) to balance the probability of a word within a topic relative to its lift (a measure of the word’s frequency within a topic relative to its marginal probability across the corpus). This measure is instrumental in distinguishing between terms that are frequent yet not exclusive to a topic (high \(\lambda\) values, near one) and those that are rare but highly indicative of a topic (low \(\lambda\) values, near zero) (Sievert and Shirley 2014). For superior, interpretative results the value of \(\lambda\) can be viewed as another hyperparameter to tune. Though such an exercise is not executed here, rather the value of \(\lambda\) is modifiable such that word rankings can be altered to aid topic interpretation. Given this, in addition to accounting for the red and light-blue bars’ widths for some word, one can determine whether a word is highly relevant to the selected topic due to its lift (high ratio of red to light-blue) or its probability (absolute width of red).
pyLDAvis for Words
When applied to the word-tokenized BoW corpus, the LDA results reveal several distinct thematic clusters. Assuming a \(\lambda\) value of 0.1, topic one, characterized by terms like “mining” and “alliance,” is acknowledged as the most prevalent (largest circle), suggesting a focus on mining-activity relations. In contrast, topic nine (smallest circle), is associated with “municipality” improvement (“refurbishment”). This thematic divserity, where topic one alludes to a national economic activity and topic nine encompasses more of a local governing body. is further reflected in the spatial distribution of these two respective circles on the left-panel. Constratingly, topics five to eight are very inter-connected given the circles’ close proximity. Topic four refers to general economic activity with the top words relating to monetary value (“millions”) and other economical terms like “spending” and “operations”. Topic one and topic four both underline some part of the economy, where the former is industry-specific whilst the latter is more generalized. This underlying commonality is demonstrated with the general closeness between these two topic-number circles. Finally, the topic-two circle is the furthest from all other topic circles as it refers to an abstraction, unity (“compatriots”) which is not an idea featured amongst other topics.
pyLDAvis for Sentences
In the case of the sentence-tokenized BoW corpus, the two topics seem to have minimal interrelation, at least suggested by the circles being quite separate in space from one another. Though, across a range of \(\lambda\) values, the establishment of differentiable, underlying themes for these two topics proves challenging given that the same top words (like “railways”) are generally present for both.
CTM
When accounting for potential correlation between topics, Figure 23 (a) (for sentence-tokenized speeches) and Figure 23 (b) (for word-tokenized speeches) indicate that the optimal number of topics for a BoW-related corpus is three and two, respectively.
As first found in pLSA, the idea of governance also features as CTM topics, for both token-type corpora. Though, instead of spreading the different facets across five topics, the CTM combines economic and social (topic one for words), in addition to structural governance (topic two for sentences) into one topic. Moreover, as seen before in LSA, the topic of societal collectives is applicable again. The only differentiable result is the first topic of the sentence-tokenized speeches being dominated by explicit, economy-related words via the overt references to monetary values (“million”, “billion”) and currency (“r” for Rands).
ATM
As for CTM, the coherence plots presented in Figure 25, suggest that three and two topics relay the maximum coherence score for the BoW-related corpus for the respective sentence-tokenized and word-tokenized speeches.
The topics suggested by ATM are the most disparate from the other topic models implemented. When incorporating author-topic information, there is a shift away from topics being mere reporting overviews of the way in which governance was achieved economically, structurally, and socially, as well as how contemporary contexts (like the pandemic, power problems, or pivotal sporting events) were confronted. Instead, particular presidents’ agendas become more apparent. For instance, Mbeki and Zuma’s agenda is entirely social cohesion and confidence (“constitution”, “democracy”, “better”) when considering sentence-based topics. Most notably, for all presidents within the word-based-topic context, there is an equal emphasis placed on the general national agenda (like the “elections”) and more economic (“rand”, “mining”) and reformation (“reform”) related priorities. Moreover, in general, ATM is able to correctly attribute the relevant speech/speeches to the correct author (i.e., president).
| President | Speech IDs | Word-Based Topics | Sentence-Based Topics |
|---|---|---|---|
| Mandela | 4, 16, 0, 23, 31, 21, 15 | General national agenda (0.52); Economic reform agenda (0.48) | Social cohesion and confidence (1.00) |
| Ramaphosa | 7, 26, 3, 24, 8, 1, 35 | General national agenda (0.30); Economic reform agenda (0.70) | Social cohesion and confidence (0.66); Land and people (0.34) |
| Mbeki | 9, 14, 22, 34, 32, 28, 29, 2, 12, 25 | General national agenda (0.33); Economic reform agenda (0.67) | Social cohesion and confidence (1.00) |
| Zuma | 13, 5, 6, 20, 19, 30, 27, 33, 11, 10 | General national agenda (0.47); Economic reform agenda (0.53) | Social cohesion and confidence (1.00) |
| deKlerk | 17 | General national agenda (0.56); Economic reform agenda (0.44) | Social cohesion and confidence (0.02); Land and people (0.98) |
| Motlanthe | 18 | General national agenda (0.55); Economic reform agenda (0.45) | Social cohesion and confidence (0.17); Prospective planning (0.83) |
Discussion and Conclusion
Sentiment analysis, via a comparison of the \(\texttt{AFINN}\) and \(\texttt{bing}\) lexicons, revealed no significant sentiment differences over time. Specific words contributing to these sentiments varied though, especially when stratifying by different presidential tenures.
The implementation of four topic models (including LSA, pLSA, LDA, CTM) demonstrated a general consistency in themes, encompassing predominately different governance dimensions. Surprisingly, lots of similarity in findings were evident between CTM and LSA/pLSA. It was also noteworthy that the ATM introduced a level of granularity that revealed more in-depth, abstract topics. Through this, more nuanced differentiation between presidents and their personal visions became apparent. This finding highlights the relevance of employing advanced topic modeling to gain a deeper understanding of individual leadership styles and priorities.
Overall, the utility of employing NLP techniques to uncover sentimental and thematic patterns has been demonstrated not only in general, but also in a niche context like presidential communication. The findings emphasize the need at times for more sophisticated methodologies (such as ATM) to uncover subtle sentiment and topic variations in political discourse. These results could potentially be refined by combining the sentiment analysis and topic modelling with expertise in political science and linguistics.